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Abstract 

We consider deployment of the particle filter on modern massively parallel hardware architectures, such 
as Graphics Processing Units (GPUs), with a focus on the resampling stage. While standard multinomial and 
stratified resamplers require a sum of importance weights computed collectively between threads, a Metropolis 
resampler favourably requires only pair-wise ratios between weights, computed independently by threads, and 
can be further tuned for performance by adjusting its number of iterations. While achieving respectable results 
for the stratified and multinomial resamplers, we demonstrate that a Metropolis resampler can be faster where 
the variance in importance weights is modest, and so is worth considering in a performance-critical context, such 
as particle Markov chain Monte Carlo and real-time applications. 

1 Introduction 

This work considers the adaptation of particle filtering ||6l HJ to modern graphics processing units (GPUs). Such 
devices are typical of a trend away from faster clock speeds toward expanding parallelism to maximise throughput 
at the processor level. Capitalising on this wider concurrency for a given application is rarely trivial, requiring 
adaptation of current best-practice sequential algorithms, development of new algorithms, or even revival of old 
ideas since bested in a serial context. This work makes such an attempt for the particle filter, with a focus on 
resampling strategies. 

Our motivation arises from Bayesian inference in large state-space models, in particular those of marine bio- 
geochemistry |'5' '9l|. The broad methodology is that of particle Markov chain Monte Carlo (PMCMC) 1 1 1, a meld 
of MCMC over parameters with sequential Monte Carlo (SMC) over state. For Metropolis-Hastings in parameter 
space, the particle filter is a candidate for computing the marginal likelihood (over the state) of each newly pro- 
posed parameter configuration. This requires iteration of the particle filter, so the use of GPUs to reduce overall 
runtime is attractive. Beyond these applications, the material here may be of relevance to GPU adaptation of 
bootstrap methods, or in the presence of hard runtime constraints as in real-time applications. 

For some sequence of time points t = 1, . . . ,T and observations at those times yi, . . . , y^-, the particle filter 
estimates the time marginals of a latent state, j3(Xf |yi:(), proceeding recursively as: 

1. (Initialisation) At < = 0, draw P samples xj, . . . , x^ ^ p(Xo). 

2. (Propagation) For t = 1, . . . ,T, i = 1, . . . , P and some proposal distribution (/(Xf |X(_i), draw xj ~ 

3. (Weighting) Weight each particle with wl = P(y*l^t)p(^il^t--i)p(^t-ilyi:t-i) ^ ^j^^^ ^j^^ weighted sample set 
{X(, uij} represents the filter density p(X(|yi.4). 

The basic particle filter suffers from degeneracy - the tendency, after several iterations, to heap all weight on 
a single particle. The usual strategy around this is to introduce a further step: 

4. (Resampling) Redraw, with replacement, P samples from the weighted sample set {xj, wl}, using weights 
as unnormalised probabilities, and assign a new weight of 1/P to each. 
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Figure 1: Visualisation of popular resampling strategies. Arcs along the perimeter of the circles represent parti- 
cles by weight, arrows indicate selected particles and are positioned (a) uniformly randomly in the multinomial 
resampler, (b & c) by evenly slicing the circle into strata and randomly selecting an offset (stratified resampler) 
or using the same offset (systematic resampler) into each stratum, or (d) initialising multiple Markov chains and 
simulating to convergence in the Metropolis resampler. 



Code 1 Pseudocode for the multinomial, systematic and Metropolis resamplers. 



Multinomial-Resampler 

1 sort weights (optional) 

2 exclusive prefix sum lui , . . . , mp to obtain Wi , . . . , Wp . 

3 for j = 1, . . . , P [> or perform as vector binary search 

4 do Uir~.U[0,Wp + wp) 

5 binary search through Wi, . . . , Wp to find j such that Wj <Ui< Wj+i 

6 Oi i 

Systematic-Resampler 

1 sort weights (optional) 

2 ti~W[0, 1) 

3 W'o ^ 

4 inclusive prefix sum wi, ... ,wp to obtain Wi , . . . , Wp 

5 for i = 1, . . . , P [> one per thread on GPU 

6 do ^[g^-u\ + i 

7 adjacent difference Oi Op to obtain oi, . . . ,op 



Initialisation, propagation and weighting are very readily parallelised, noting that these are independent op- 
erations on each particle. Resampling, on the other hand, may require synchronisation and collective operations 
across particles, particularly to sum weights for normalisation. The focus of this work is on this resampling stage. 

Parallel implementation of resampling schemes has been considered before, largely in the context of distributed 
memory parallelism [T, T|. Likewise, implementation of the particle filter on GPUs has been considered |12J|7j, 
although with an emphasis on low-level implementation issues rather than potential algorithmic adaptations as 
here. 

2 Resampling algorithms 

For each particle i at time t, consider the resampling algorithm to be the means by which its number of offspring, 
Oi, for propagation to i + 1 is selected, or alternatively the means by which its ancestor, ai, from time i — 1 is 
selected. Figure [T]describes a number of popular resampling schemes, while Code[T]provides pseudocode. 

The stratified and systematic |10| resamplers output offspring, while the multinomial and Metropolis resam- 
plers output ancestors. Converting between the two is straightforward. The multinomial, stratified and systematic 
resamplers require a collective prefix-sum of weights. While this can be performed quite efficiently on GPUs 1 14], 
it is not ideal, requiring thread communication and multiple kernel launches. The Metropolis resampler requires 



Metropolis-Res AMPLER(B) 
1 for i = 1, . . . , P > one per thread on GPU 



2 do p i 

3 for n — I, . . . , B 

4 do u ~ W[0, 1) 

5 g~W{l,...,P} 

6 if u < Wq/wp 
1 then p ■(— q 
8 ai p 
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only ratios between weights, so that threads can operate independently in a single kernel launch; this is more 
suited to GPU architectures. 

The Metropolis resampler is parameterised by B, the number of iterations to be performed for each particle 
before settling on its chosen ancestor. Selecting _B is a tradeoff between speed and reliability: while runtime 
reduces with fewer steps, the sample will be biased if B is too small to ensure convergence. Bias may not be such 
a problem for tracking applications where performance is judged by outcomes, but will violate assumptions that 
lead to unbiased state estimates in a particle MCMC framework 

Convergence depends largely on the particle of greatest weight, Pmux, being exposed by a sufficient number 
of proposals that the probability of it being returned by any one chain approaches lUmax = max; / . 
Following Raftery and Lewis |13|[§2.1], construct a binary 0-1 process Z„ = (5(C/„ ~ Pmax) over the sequence 
Un generated by a single chain of the Metropolis resampler It seems sensible now to require that P{Zb = 1 1 Zq) 
be within some e of Wmax- Zn is a Markov chain with transition matrix given by; 

^=('ri-%)- <" 

where a is the probability of transitioning from 1 to 0, and f3 from to 1 . As a uniform proposal across all particle 
indices is used (Metropolis-Resampler, line|5]l, the chance of selecting p^nax is ^/P, being of greatest weight 
this will always be accepted through the Metropolis criterion (line|6]l, and so (3 = I /P. For a, we have: 

. t ■ , \^ '^max / ^ "-'max 

The /-step transition matrix is then: 
where A = (1 — a — (3), and we require; 

A-<^^, (4) 

satisfied when; 

log A 

In practice one may select P and an upper tolerance for Wmax based on expected weight variances given the quality 
of q{-) proposals, and compute an appropriate B as above to use throughout the filter. 



3 Experiments 

Weight sets are simulated to assess the speed and accuracy of each resampling algorithm. Assume that weights 
are approximately Dirichlet distributed, w ^ Dir(Q;), with ai ^ a2 = ■ ■ ■ = ap = a. Data sets are simulated 
for all combinations of P = 256, 512, . . . , 65536 and a = 10, 1, .1, .01. 

Experiments are conducted on an NVIDIA Tesla S2050 using double precision, CUDA 3.1, gcc 4.3 and all 
compiler optimisations. Implementation of the Metropolis resampler uses a custom kernel, with random numbers 
provided by the Tausworthe iflTl generator of the Thrust library |8|. The multinomial and systematic resamplers 
use vector operations of Thrust. To remove the overhead of dynamic memory allocations, temporaries allocated 
by Thrust have been replaced with pooled memory, reusing previously allocated arrays. Figure |2] gives both the 
accuracy and runtime of the multinomial, systematic and Metropolis resamplers across P and a. 

The Metropolis resampler converges to a multinomial resampling as the number of steps increases, such that 
the error in outcomes should match, but can never beat, that of the multinomial resampler. Figure |2] shows that 
the Metropolis resampler has converged in all cases, so certainly the estimates of B from our analysis appear 
reliable. The stratified resampler is known to minimise the variance, and thus error, of resampling outcomes. 
The results here confirm this expectation. The multinomial resampler performs slightly faster than the stratified 
resampler with sorting enabled, but slightly slower when not. This is due to the binary search proceeding faster 
when weights are sorted. Certainly any thought of pre-sorting to hasten the binary search seems futile - the saving 
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Figure 2: Experimental results over 1000 resamplings: (top row) mean error, p ~ ^"''^ Y, and (bottom 

row) runtime across various numbers of particles, P, with Dirichlet a parameter of (left to right) 10, 1, .1 and 
.01. 



is much less than the additional overhead of the sort on the GPU. Judging by the lack of scaling over P, runtime of 
the systematic and sorted multinomial resamplers appears dominated by the overhead of multiple kernel launches. 

At a of . 1 and .01, the unsorted systematic resampler beats the others in runtime as well as error. At a of 10 and 
1, the Metropolis resampler is fastest for P < 4096; in all other cases B must be too large to be competitive. Note 
that the modest variance in weights at these a means that the fairest comparison is against the unsorted approaches, 
as pre-sorting is unnecessary for numerical accuracy. At a = 1, effective sample size (ESS) is approximately .5P, 
making this quite a realistic scenario. At a = .1, ESS is about .IP, and at a = .01 about .OIP: more severe 
cases. 

Performance of the Metropolis resampler hinges almost entirely on the rate at which random numbers can be 
generated, and this should be the first focus for further runtime gains. It is possible to reduce B, proportionately 
reducing runtime, but care should be taken that resampling results are not biased as a result. Nevertheless, this 
may be worthwhile under tight performance constraints, and indeed such configurability might be considered an 
advantage. 
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